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The learning difficulties described by students in statistics courses continues to engage researchers from 
several disciplines. One source of difficulty for graduate students in educational statistics courses is the reading 
difficulty of the textbook Unfortunately, instructors pondering text adoption decisions typically have little 
information about the reading difficulty of a text. Although reading difficulty is almost always assessed using a 
reading difficulty formula, these formulas have not been evaluated for use with graduate students majoring in the 
social sciences. An evaluation study was done to examine the extent to which five reading difficulty formulas often 
used with K-12 populations consistently and validly differentiate among introductory statistics texts used in a school 
or college of education. Preliminary findings suggest that these formulas are of only modest utility in assessing the 
reading difficulty of statistics texts, and that the development of new formulas is indicated. 
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Introduction 

The learning difficulties described by students in introductory statistics courses have not gone unnoticed, and 
the statistical education literature has identified several factors that are associated with a student's perception of 
statistics and with their success in these courses. One of the more prominent of these factors is the text used in the 
course (Cobb, 1987). 

Harwell, Herrick, Curtis, Mundfrom, and Gold (1996) documented the need for an objective method for 
evaluating introductory statistics texts and offered a literature-based evaluation framework that could be used to 
assist instructors struggling with course adoption decisions and journal reviewers of such texts. Construction of 
paper-and-pencil instruments by these authors was guided by the finding that college students (and, presumably, 
graduate students) possess sufficient knowledge, ability, and motivation to overcome deficiencies in the style of a 
text if the content meets the needs of a course, particularity if the course is outside a student's major (Redei, 1984). 
This finding is especially relevant for students taking introductory statistics in a school or college of education 
because most are not majoring in statistics. 

A key finding of Harwell, et al. was that the reading difficulty of the text appeared to play an important role in 
the students motivation and perception of the course. Students who felt their statistics text was difficult to read were 
frustrated and generally felt the text was of little help, whereas students who felt the reading level of the text was 
"about right" were less frustrated and generally believed the text was useful. These results were consistent with 
work by Major (1955) and Schwartz, Sparkman, and Deese (1970) that suggested that readers could provide valuable 
information about text readability. Harwell, et al. also suggested that comparisons of the reading difficulty of 
introductory statistics texts would provide important evaluative information. 

Ideally, reading difficulty would be assessed using a test constructed for that purpose, but the burden of 
developing such instruments has led to the use of reading difficulty formulas as a substitute. Unfortunately, 
available formulas for assessing reading difficulty typically target K-12 populations and pay only modest attention to 
college students and even less to graduate students. Moreover, these formulas are rarely used for numerically- 




4 

oriented texts, which suggests that they may not generalizable to introductory statistics texts. The apparent 
differences encountered by readers of, for example, a history text versus a statistics text with its numerous equations 
and computations, makes such generalizations suspect. In any event, the extent to which the formulas agree among 
themselves in rank-ordering the reading difficulty of introductory statistics texts in education (i.e., consistentcy), and 
the validity with which they differentiate among these texts, is not known. 

Statement of the Problem 

The research question of interest was: How consistently and validly do several reading difficulty formulas 
differentiate between the reading difficulty of introductory statistics texts used in education? The results may allow 
one or more of these formulas to be recommended for use in course adoption decisions and to assist journal 
reviewers of these texts. On the other hand, if none of the reading difficulty formulas show evidence of consistency 
(i.e., agreement) and validity, development of new and improved formulas would be indicated. 

Review of the Literature 

Evaluating reading difficulty has a long history. For example, Farr, Jenkins, and Paterson (1951) evaluated the 
reading difficulty of employee handbooks for a large North American car manufacturer, Roit (1984) examined the 
reading difficulty of government documents provided to parents of handicapped children, Plake (1988) studied the 
reading difficulty of various licensure and certification materials, Prout (1989) examined the reading difficulty of 
selected child and adolescent self-report measures, Visser (1994) evaluated the reading difficulty of AIDS/HIV 
education materials, and Giordano (1985), Hill (1984), Quereshi (1991), and Redei (1984) assessed the reading 
difficulty of educational texts for K-12 and college students. 

Assessing reading difficulty has traditionally been done using a reading difficulty formula or RDF, which 
typically measures, directly or indirectly, sentence length and word difficulty (Meyer, Marsiske, & Willis, 1993). The 
intent is to use the scores produced by a RDF to distinguish between texts of varying reading difficulty, allowing a 
more accurate match between texts and students. The practical value of a RDF was described by Klare (1974) as "... 
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a predictive device in the sense that no actual participation by readers is needed/' (p. 64) Klare emphasized that 
these formulas do not indicate why something is more or less difficult to read. 

A RDF typically involves a linear combination of a small number of predictors weighted by estimated 
regression coefficients obtained from a norming study in which a proxy for reading difficulty, such as a standardized 
reading comprehension test, serves as the dependent variable. Klare noted that more than 30 RDFs are available, 
and offered evidence that the use of sentence length and word difficulty are often sufficient to make reasonably good 
predictions about readability. 

Perhaps the most commonly used RDF is that due to Flesch (1948), which has the form: 

Reading Ease (RE) = 206.835 - .846 WL -1.015 SL (1) 

where WL = number of syllables per 100 words, SL is the average number of words per sentence, and WL and SL are 
predictor variables in a multiple regression equation (Klare, 1974). The dependent variable used in developing the 
regression equation was a test of reading comprehension. In evaluating reading difficulty for a text, one simply 
computes values for SL and WL for an arbitrary number of pages in the text , averages these values, and substitutes 
the averages into the above equation. Lower (possibly negative) values of Reading Ease indicate easier-to-read 
material. Other reading difficulty formulas given positive reviews by Klare included the following: 



New Reading Ease (NRE) = 1.599(# of one-syllable words per 100 words)- (2) 

1.015(average sentence length)-31.517 (Farr, et al., 1951) 

Danielson & Bryan (DB) =1.0364(# of characters per space)+ (3) 

.0194(# of characters per sentence)-.6059 (Danielson & Bryan, 1963) 

FOG1 = (.4)average sentence length +%words with £ 3 syllables (Gunning, 1942) (4) 

FOG2 = (.4)average sentence length +%words with £4 syllables. (5) 

In all cases, large positive scores indicate more difficult material. 



Klare emphasized that the validity of these formulas must come from evaluation studies, yet few evaluation 
studies have been done and apparently none for populations of graduate students majoring in the social sciences. 
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For this and other reasons these formulas have been criticized (Dryer, 1984), yet still appear to be widely used (e.g., 
Meyer, et al., 1993; Plake, 1988). 



Methodology 

The purpose of the study was to evaluate the consistency and validity with which various reading difficulty 
formulas differentiated among introductory statistics texts used in a school or college of education. The target 
population was graduate students who take an introductory statistics course as part of their training. The study was 
conducted in two stages. 

First, the extent to which the reading difficulty formulas consistently rank-ordered the reading difficulty of the 
texts was investigated. In experimental design terms, two independent variables were manipulated in this stage: 
textbooks (N = 10) and number of pages used to extract readability information (J = 3 or 6). Each RDF served as a 
dependent variable. 

The texts which were selected were judged to be the most frequently used in introductory statistics classes by 
an informal survey of several members of the Educational Statisticians, a Special Interest Group within the American 
Educational Research Association. Each text was divided into J =3 or 6 sections of approximately equal size and from 
each section a page was randomly chosen. Although J = 3 is probably the most commonly used value, two values of 
J were used since there does not appear to be any agreed upon minimum number that should be used in extracting 
readability information. Using multiple pages per book (in effect, replications) is also consistent with the 
recommendation of Fitzgerald (1981). 

On each of the J randomly selected pages, the numbers of syllables, 1-syllable words, 3-syllable words, and 4- 
syllable words were counted for the first 100 words in a randomly selected paragraph on that page and recorded by 
the co-author, a graduate student majoring in research methodology. In addition, the number of characters and 
number of sentences within each 100-word block were recorded. The coded variables represented terms needed by 
the various RDFs. The quality of the coding was checked by having the first author select one page at random from 
each of 4 texts randomly selected from the 10, and code the quantities used to compute the RDFs. Comparing the 
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two sets of codings produced a simple percent agreement value virtually equal to 100%, suggesting that the coded 
values were accurate. 

For each RDF, 3 scores were obtained for each text for the J = 3 condition and 6 scores for the J = 6 condition. 
An examination of the means, medians, standard deviations, etc. of the 10 (text) x 2 (number of pages) data matrix 
for each RDF suggested that there was little difference between using 3 versus 6 pages, and subsequent analyses 
used the J = 6 data. 

Results 

Consistency of the Reading Difficulty Formulas 

The consistency or agreement among the 5 formulas in rank-ordering the reading difficulty of the texts was 
investigated in three ways. First, the reading difficulty scores were plotted against texts (see Figures 1-5). The plots 
for three sets of scores (DB, FOG1, FOG2) appear to be similar and somewhat different from the plots for RE and 
NRE. The DB, FOG1, and FOG2 plots also suggest that there are three groupings of texts: those more difficult to read 
(texts 2, 4-7), those easier to read (texts 3,9), with the remaining texts being of average reading difficulty. Because the 
RE and NRE results were quite similar only NRE results are reported from here on. 

Summary statistics for the reading difficulty scores by text are reported in Table 1 as z-scores. The reading 
difficulty variables were treated as linear transformations of one another because they are, by definition, attempting 
to measure the same thing (i.e., reading difficulty), use many of the same terms in their calculation (e.g., average 
sentence length), and showed between-formula correlations exceeding .70. In addition, a principal-axis factoring 
with oblique rotation suggested that a single factor seemed to underlie the RDFs. This factor accounted for 85% of 
the variance with factor loadings that exceeded .9 for every RDF. 

Each value in Table 1 is the average of 6 ratings (one from each of 6 pages) for an RDF. Averages close to zero 
indicate that the texts' reading difficulty rating by a RDF is near the mean of all such ratings assigned to the texts by 
that RDF; a large negative value means that text was judged to be easier to read than other texts by the RDF, and a 
large positive value that that text was judged more difficult to read. The value in parentheses is the standard 
deviation of the 6 ratings. 
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The results in Table 1 suggest that the agreement among the DB, FOG1, and FOG2 formulas is quite good for 
several of the texts and somewhat poor for others. For example, the Toothaker text was rated as quite easy to read by 
DB, FOG1, and FOG2; similarly, the Hays text was rated fairly difficult to read by these same RDFs. On the other 
hand, DB and FOG1 rated the Glass and Hopkins and Howell texts as being of average reading difficulty whereas 
FOG2 rated them as more difficult than average. One consistent pattern in Table 1 is that NRE almost always 
disagreed with the other RDFs. For example, Hays was rated as relatively easy to read and the Hinkle, Wiersma, and 
Jurs texts as more difficult to read by NRE, whereas the other RDFs rated these two texts in the opposite direction. 

On the whole, the DB, FOG1, and FOG2 RDFs strongly agreed on the reading level of only 4 of the 10 texts (Hays; 
Hinkle Wiersma, & Jurs; Popham & Sirotnik; Toothaker). 

Table 1 is instructive but additional analysis of the ratings was done to further summarize the findings. 
Because gauging the magnitude of differences among average z-scores in Table 1 can be difficult (For example, do 
average z-scores of .10, .33, .37, and .04 for a text suggest agreement?), the values in Table 1 were ranked from 1 to 10 
within each column, where a rank of 1 indicated a text whose reading level was more difficult than average. 
Agreement was assessed by examining the rankings within texts, across pairs of RDFs. Pairs of rankings were said to 
agree exactly if the ranks assigned to the texts by two RDFs were identical. With a possible range of 0-100%, the 
exact agreement computed for pairs of RDFs reported in Table 2 was 28.3% (17/60, where 60 = 6 pairs x 10 texts). 

The definition of agreement was then relaxed to allow pairs of rankings within a ±1 range to be considered to be in 
agreement. Thus, if the rankings of a text assigned by two RDFs were 3 and 4, they would be considered to agree. 
With this definition, the overall agreement rate among the RDFs was 45% (27/60); if the definition of agreement is 
further relaxed to ± 2, the overall agreement rate was 60%. Percent agreements for pairs of RDFs for the ±1 and ±2 
agreement criteria are reported in Table 2. The strongest agreement for all three definitions of agreement was 
between FOG1 and FOG2, and in general the poorest agreement involved NRE. 

Next, a oneway analysis of variance was performed for each RDF, treating Texts as a random effect. The 
results of these analyses are reported in Table 3. Statistically, NRE did not differentiate among the 10 texts whereas 
the remaining RDFs did with similar efficiency. The square root of the estimated components suggests that the 




spread of the reading difficulties in the population of texts is similar for DB, FOG1, and FOG2, and noticeably 
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different for NRE. 

The results reported in Tables 1-3 suggest that the strongest agreement was between FOG1 and FOG2 
followed by FOG1 and DB, and FOG2 and DB. As suggested by the results in Tables 1 and 2, the poorest agreement 
was between NRE and the remaining formulas. Similarly, the only statistically non-significant result for the 
ANOVAs was for NRE. Thus, there is evidence that the DB, FOG1, and FOG2 indices produce similar rank- 
orderings of reading difficulty. The NRE formula, on the other hand, showed noticeably less agreement with the 
other formulas. Still, the overall consistentcy among DB, FOG1, and FOG2 for the more relaxed definition of 
agreement was, at best, moderate. 

Preliminary Validity Findings 

The results in Tables 1-3 bear on the consistency with which the RDFs rank-ordered the reading difficulty of 
the texts, but do not provide information about the correctness or validity of the rankings. Preliminary validity 
evidence was generated by comparing the rankings of text reading difficulty generated by the RDFs with rankings 
produced by content experts. The process used to assess validity had two components, analysis of data from the 
Harwell, et al. study and collection of data from additional content experts in statistical education. 

The Harwell, et al. study evaluated several statistical texts, 4 of which were also examined in our study. 
Harwell, et al. recruited a total of 6 instructors (content experts) with 3 or more years experience teaching statistics 
who were asked to rate the reading difficulty of texts using items from an instrument piloted by Harwell, et al. 
Approximately 253 students spanning four universities in the United States were also recruited in the Harwell, et al. 
study and asked to respond to virtually the same questionnaire items given to instructors. These items included 
"How would you describe the reading level of the textbook?" (1= very easy, 5=very difficult), "The text is easy to 
understand and follow" (l=stronglv disagree, 5=strongly agree), and a series of check-off questions, such as "The 
writing style is " (terse, wordy, simple, balanced, etc.). 
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Responses to the above items were used to provide validity evidence. The rationale was that a text which was 
ranked as difficult to read by the content experts and a RDF provides validity evidence because the RDF rankings 
agree with those of identified content experts. For example, suppose a RDF ranked one of the texts used in Harwell, 
et al. (say, text A) as the most difficult to read and another text (say, text B) as the easiest to read. If the content expert 
data indicated that, among the texts, A was the most difficult to read and B was the easiest, that would serve as 
validity evidence for that RDF. An analysis of student data that also suggested that A was the most difficult to read 
and B the easiest would provide some validity evidence, although the content expert data would carry more weight. 

Analysis of the Harwell, et al. data for 4 of the 10 texts examined in this paper is given in Table 4. Although 
the data are preliminary, these results show little agreement with the rankings in Table 2. For example, Table 2 
suggests that the Popham and Sirotnik text is among the most difficult to read whereas two content experts thought 
it was among the easiest to read. Similarly, the Table 2 rankings suggest the Hays and Glass and Hopkins texts were 
of approximately average reading difficulty for the sample of 10 texts whereas the content experts thought they were 
relatively difficult to read. Additional validity data are currently being collected from newly recruited content 
experts for all 10 texts examined in this study. Unfortunately, the latter data are not yet complete and our validity 
findings are necessarily preliminary. 



Conclusions and Implications 

Our results so far raise concerns about applying traditional reading difficulty formulas to graduate level 
statistics texts. The agreement among these formulas in rank-ordering texts appears to be, at best, moderate, and 
preliminary validity findings are not encouraging. The information from the current study will generate specific 
recommendations for using one or more of the reading difficulty formulas (or possibly none). Should the 
preliminary findings hold up, then new reading difficulty formulas targeting statistics texts would need to be 
developed. 
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Table 1 

Descriptive Statistics for Reading Difficulty Formula by Text 



Text 


NRE 


DB 


FOG1 


FOG2 


Glass & Hopkins 


-.36 (.69) 


-.02 (.74) 


.05 (.78) 


.34 (.53) 


Hays 


-.47 (1.42) 


.35 (1.45) 


.39 (.88) 


.38 (1.47) 


Hinkle, Wiersma, & Jurs 


.38 (.92) 


-.36 (.57) 


-.7 (.91) 


-.64 (.9) 


Howell 


.10 (1.5) 


.33 (.56) 


.37 (.67) 


.04 (.94) 


Popham & Sirotnik 


-.78(1) 


.33 (1.22) 


.41 (1.01) 


.54(1.17) 


Marascuilo & Serlin 


-.19 (.91) 


.46 (1.15) 


.31 (1.13) 


.20 (.7) 


Freedman, Pisani, Purves, & Adhikari 


.15 (.53) 


.2 (.83) 


.38 (.95) 


.27 (.97) 


Mosteller, Fienberg, & Rourke 


.07 (.85) 


.07 (.69) 


.18 (.92) 


.19 (.88) 


Toothaker 


.73 (.72) 


-1.46 (.82) 


-1.37 (.83) 


-1.26 (.78) 


Shavelson 


.36 (.57) 


.16 (.54) 


0 (.83) 


-.06 (.47) 



Note. NRE, DB, FOG1, and FOG2 are the four reading difficulty formulas 
investigated in the current study. The tabled values represent the ratings of the 
formulas for the texts expressed as z-scores and the values in parentheses represent 
the standard deviation of the ratings. 




Table 2 

Percent Agreement Among Reading Difficulty Formulas 



Percent Agreement Among Rankings 

NRE DB FOG1 FOG2 

NRE 30%(40%) 40(50) 20(40) 

DB 50 (80) 50 (70) 

FOG1 80 (80) 



Note. NRE, DB, FOG1, and FOG2 represent the four reading 
difficulty formulas. The 30% agreement rate for the NRE/DB pairing 
indicates that when rankings were considered to agree if they 
were within ± 1, these two reading difficulty formulas agreed on 3 
out of 10 texts or 30%; if the definition of agreement is ± 2, the two 
formulas agreed on 4 out of the 10 texts or an agreement rate of 40%. 



Table 3 

Oneway Analysis of Variance Results 



Readability 

Formula 


F 


P 


a 


NRE 


1.27 


.28 




DB 


2.35 


.027 


.93 


FOG1 


2.54 


.017 


.94 


FOG2 


2.11 


.046 


.95 



Note. NRE, DB, Fogl, and Fog2 represent 
the four reading difficulty formulas. F = 
computed F test, p = p-value, a = square root 
of the estimated variance component. 




Table 4 

Summary of the Harwell, et al. content expert data for selected items 
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Text 



Item 


GH 


HWJ 


MS 


PS 


Reading level about right 


3/4 


2/3 


1/1 


1/1 


Difficult 


1/4 


1/3 


0/1 


0/1 


Text easy to understand and follow 


Agree 


2/4 


2/3 


1/1 


1/1 


Writing style understandable 


Agree 


1/4 


1/3 


1/1 


0/1 



Note. GH = Glass and Hopkins, HWJ = Hinkle, Wiersma, and Jurs, MS = Marascuilo 
and Serlin, PS = Popham & Sirotnik. In the first row of data in the table, the 3/4 
means that 3 out of 4 content experts that the reading level of the Glass and Hopkins text 
was about right, whereas the remaining content expert thought it's reading level was 
too difficult. 
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